For this project, I found an interesting dataset on Kaggle that lists the proportion of political seats held by women in each country by year. I was curious to see what the distribution of women in power looks like globally, and if it has changed over time. This is the dataset that I used: https://www.kaggle.com/datasets/mathurinache/women-in-power3?select=Viz5_August_Female_Political_Representation.csv
import pandas as pd
import matplotlib.pyplot as plt
women = pd.read_csv('Viz5_August_Female_Political_Representation.csv')
women.head()
Country Name | Country Code | Year | Proportion of seats held by women in national parliaments (%) | |
---|---|---|---|---|
0 | Albania | ALB | 1997 | NaN |
1 | Albania | ALB | 1998 | NaN |
2 | Albania | ALB | 1999 | 0.051613 |
3 | Albania | ALB | 2000 | 0.051613 |
4 | Albania | ALB | 2001 | 0.057143 |
Above is a snapshot of the data in this dataset. The dataset is global, and while it may be missing a few pieces of information, I would say it is a population rather than a sample.
I'm interested in the proportion of seats held by women in national parliaments column, to look at the distribution, any outliers, and change over time. This will be interesting to look at because it should provide insight into the standing of women in power over time. I'm hoping looking at the statistical distribution will reveal some nuance in the overall trend.
First though, I can see that I'll need to reshape the data because the years are all in one column and I need them to be in separate columns. Below shows the result of my my pivot.
by_year = women.pivot_table('Proportion of seats held by women in national parliaments (%)', index = 'Country Name',
columns='Year')
by_year.head()
Year | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country Name | |||||||||||||||||||||
Albania | NaN | NaN | 0.051613 | 0.051613 | 0.057143 | 0.057143 | 0.057143 | 0.064286 | 0.071429 | 0.071429 | ... | 0.164286 | 0.157143 | 0.157143 | 0.178571 | 0.200000 | 0.207143 | 0.228571 | 0.278571 | 0.278571 | 0.295082 |
Algeria | 0.031579 | 0.031579 | 0.031579 | 0.034211 | 0.034211 | 0.061697 | 0.061697 | 0.061697 | 0.061697 | 0.061697 | ... | 0.077121 | 0.079692 | 0.316017 | 0.316017 | 0.316017 | 0.316017 | 0.316017 | 0.257576 | 0.257576 | 0.257576 |
Andorra | 0.071429 | 0.071429 | 0.071429 | 0.071429 | 0.142857 | 0.142857 | 0.142857 | 0.142857 | 0.285714 | 0.285714 | ... | 0.357143 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.392857 | 0.321429 | 0.321429 | 0.321429 | 0.464286 |
Angola | 0.095455 | 0.154545 | 0.154545 | 0.154545 | 0.154545 | 0.154545 | 0.154545 | 0.150000 | 0.150000 | 0.150000 | ... | 0.386364 | 0.381818 | 0.340909 | 0.340909 | 0.368182 | 0.368182 | 0.368182 | 0.304545 | 0.304545 | 0.300000 |
Antigua and Barbuda | 0.052632 | 0.052632 | NaN | 0.052632 | 0.052632 | 0.052632 | 0.052632 | 0.105263 | 0.105263 | 0.105263 | ... | 0.105263 | 0.105263 | 0.105263 | 0.105263 | 0.111111 | 0.111111 | 0.111111 | 0.111111 | 0.111111 | 0.111111 |
5 rows × 23 columns
This is what I wanted, but I thought it was too granular, so I decided to select the columns every 5 years ending at the most recent (2019).
Below describes the data, including the mean, standard deviation, min, Q1, median, Q3, and max. One thing that sticks out to me is that the standard deviation gets larger over time, which appears to be because the min stays at 0, but the max grows over time. This means the range is getting larger, which naturally means more variability.
by_year[[1999,2004,2009,2014,2019]].describe()
Year | 1999 | 2004 | 2009 | 2014 | 2019 |
---|---|---|---|---|---|
count | 184.000000 | 213.000000 | 214.000000 | 211.000000 | 214.000000 |
mean | 0.115095 | 0.146699 | 0.178698 | 0.205967 | 0.232317 |
std | 0.081317 | 0.090338 | 0.100370 | 0.109409 | 0.113268 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.056393 | 0.090159 | 0.106012 | 0.132161 | 0.160500 |
50% | 0.107625 | 0.132075 | 0.174006 | 0.195741 | 0.220057 |
75% | 0.160685 | 0.192308 | 0.226250 | 0.260849 | 0.299257 |
max | 0.426934 | 0.487500 | 0.562500 | 0.637500 | 0.612500 |
The remaining statistical measures that aren't included in the describe() function are listed below.
print('MODE')
print(by_year[[1999,2004,2009,2014,2019]].mode())
print('VARIANCE')
print(by_year[[1999,2004,2009,2014,2019]].var())
print('MEAN ABSOLUTE DEVIATION')
print(by_year[[1999,2004,2009,2014,2019]].mad())
print('RANGE')
proportion_range = by_year[[1999,2004,2009,2014,2019]].max() - by_year[[1999,2004,2009,2014,2019]].min()
print(proportion_range)
MODE Year 1999 2004 2009 2014 2019 0 0.0 0.0 0.0 0.0 0.20 1 NaN NaN NaN NaN 0.25 VARIANCE Year 1999 0.006613 2004 0.008161 2009 0.010074 2014 0.011970 2019 0.012830 dtype: float64 MEAN ABSOLUTE DEVIATION Year 1999 0.061107 2004 0.068598 2009 0.075702 2014 0.082533 2019 0.086442 dtype: float64 RANGE Year 1999 0.426934 2004 0.487500 2009 0.562500 2014 0.637500 2019 0.612500 dtype: float64
import matplotlib.ticker as mtick
by_year[[1999,2004,2009,2014,2019]].boxplot(grid=False, figsize = (12,8), showmeans=True)
plt.ylabel('share of seats held by women', fontsize=14)
plt.title('Proportion of political seats held by women globally', fontsize=16)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
After reviewing my boxplot, I noticed that the upper bound grew over time, and the number of outliers shrank. I decided to look at which countries remained outliers in 2019 (below).
IQR2019 = by_year[2019].describe()['75%'] - by_year[2019].describe()['25%']
print('IQR 2019:', IQR2019)
upper_bound = by_year[2019].describe()['75%'] + (1.5 * IQR2019)
print('Upper Bound 2019:', upper_bound)
lower_bound = by_year[2019].describe()['25%'] - (1.5 * IQR2019)
print('Lower Bound 2019:', lower_bound)
print()
outliers_in_2019 = by_year[2019][by_year[2019] > upper_bound]
print('Outliers in 2019:')
print(outliers_in_2019)
IQR 2019: 0.13875742574249997 Upper Bound 2019: 0.5073935643562499 Lower Bound 2019: -0.04763613861374996 Outliers in 2019: Country Name Bolivia 0.530769 Cuba 0.532231 Rwanda 0.612500 Name: 2019, dtype: float64
by_year[[1999,2004, 2009, 2014, 2019]].hist(grid=False, color='thistle', edgecolor='slategrey', figsize = (12,11),
bins=[0,.05,.10,.15,.20,.25,.30,.35,.40,.45,.50,.55,.60,.65])
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
After trying the histogram with all 5 years I'd included in my boxplot, I noticed that the middle three don't change that much and generally all move in the direction of the final one, so I decided it might be more impactful to only look at the first and last to see the change that happened in 20 years. That is below.
fig = by_year[[1999,2019]].hist(grid=False, color='thistle', edgecolor='slategrey', figsize = (14,6),
bins=[0,.05,.10,.15,.20,.25,.30,.35,.40,.45,.50,.55,.60,.65])
fig[0][0].set_xlabel('share of seats held by women', fontsize=12)
fig[0][0].set_ylabel('count of countries', fontsize=12)
fig[0][0].set_title('1999 global women in power', fontsize=14)
fig[0][0].xaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))
fig[0][1].set_xlabel('share of seats held by women', fontsize=12)
fig[0][1].set_ylabel('count of countries', fontsize=12)
fig[0][1].set_title('2019 global women in power', fontsize=14)
fig[0][1].xaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
The mean and median are fairly close in this dataset, which you can see on the boxplot (the green lines are the medians, and the green triangles are the means). I would ultimately choose the median as most appropriate, because there are outliers that do pull the mean up a bit each year. The mode is not appropriate, because in many years it is 0, which is not representative of the global picture.
My variables did contain outliers, all above the upper bounds, since the lower bound in every year was 0 (using the calculation, it would actually be negative which doesn't make sense for percentages).
In 2019, the upper bound for the whiskers was 50.74%, and the lower bound would technically be -4.76%, but was truly 0%.
In 1999, the data had a skewed right distribution, and in 2019 it was unimodal, though still slightly skewed right.
When I looked at the details of the outliers in 2019, I was quite surprised to see that they were Bolivia, Cuba, and Rwanda with the highest share of women in political seats. I would have expected larger, wealthier countries to have the highest shares.
I was pleased to see progress with more women holding political seats over time, but there's still a fair amount of progress to be made, as even in 2019 the largest bins on the histogram are between 10% and 25%.
I also noticed on the boxplot that the Q3 grew more over time than the Q1, which would indicate that there are some countries where very little progress is being made. Countries that already had more female political representation are more likely to grow female political representation.